Ryusuke MIYAMOTO Hiroki SUGANO
Nowadays, pedestrian recognition for automotive and security applications that require accurate recognition in images taken from distant observation points is a recent challenging problem in the field of computer vision. To achieve accurate recognition, both detection and tracking must be precise. For detection, some excellent schemes suitable for pedestrian recognition from distant observation points are proposed, however, no tracking schemes can achieve sufficient performance. To construct an accurate tracking scheme suitable for pedestrian recognition from distant observation points, we propose a novel pedestrian tracking scheme using multiple cues: HSV histograms and HOG features. Experimental results show that the proposed scheme can properly track a target pedestrian where tracking schemes using only a single cue fails. Moreover, we implement the proposed scheme on NVIDIA® TeslaTM C1060 processor, one of the latest GPU, to achieve real-time processing of the proposed scheme. Experimental results show that computation time required for tracking of a frame by our implementation is reduced to 8.80 ms even though Intel® CoreTM i7 CPU 975 @ 3.33 GHz spends 111 ms.
Je-Hoon LEE Sang-Choon KIM Young-Jun SONG
This paper presents a high-speed SHA-1 implementation. Unlike the conventional unfolding transformation, the proposed unfolding transformation technique makes the combined hash operation blocks to have almost the same delay overhead regardless of the unfolding factor. It can achieve high throughput of SHA-1 implementation by avoiding the performance degradation caused by the first hash computation. We demonstrate the proposed SHA-1 architecture on a FPGA chip. From the experimental results, the SHA-1 architecture with unfolding factor 5 shows 1.17 Gbps. The proposed SHA-1 architecture can achieve about 31% performance improvements compared to its counterparts. Thus, the proposed SHA-1 can be applicable for the security of the high-speed but compact mobile appliances.
Woong-Kee LOH Yang-Sae MOON Wookey LEE
Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on `divide-and-conquer' strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.
Amedeo CAPOZZOLI Claudio CURCIO Antonio DI VICO Angelo LISENO
We develop an effective algorithm, based on the filtered backprojection (FBP) approach, for the imaging of vegetation. Under the FBP scheme, the reconstruction amounts at a non-trivial Fourier inversion, since the data are Fourier samples arranged on a non-Cartesian grid. The computational issue is efficiently tackled by Non-Uniform Fast Fourier Transforms (NUFFTs), whose complexity grows asymptotically as that of a standard FFT. Furthermore, significant speed-ups, as compared to fast CPU implementations, are obtained by a parallel versions of the NUFFT algorithm, purposely designed to be run on Graphic Processing Units (GPUs) by using the CUDA language. The performance of the parallel algorithm has been assessed in comparison to a CPU-multicore accelerated, Matlab implementation of the same routine, to other CPU-multicore accelerated implementations based on standard FFT and employing linear, cubic, spline and sinc interpolations and to a different, parallel algorithm exploiting a parallel linear interpolation stage. The proposed approach has resulted the most computationally convenient. Furthermore, an indoor, polarimetric experimental setup is developed, capable to isolate and introduce, one at a time, different non-idealities of a real acquisition, as the sources (wind, rain) of temporal decorrelation. Experimental far-field polarimetric measurements on a thuja plicata (western redcedar) tree point out the performance of the set up algorithm, its robustness against data truncation and temporal decorrelation as well as the possibility of discriminating scatterers with different features within the investigated scene.
The proposed scheduling scheme minimizes the mean energy consumption of a real-time parallel task, where the task has the probabilistic computation amount and can be executed concurrently on multiple cores. The scheme determines a pertinent number of cores allocated to the task execution and the instant frequency supplied to the allocated cores. Evaluation shows that the scheme saves manifest amount of the energy consumed by the previous method minimizing the mean energy consumption on a single core.
This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband EngineTM (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3TM equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
Yuji OKAZAKI Takanori UNO Hideki ASAI
In this paper, we propose an optimization system with parallel processing for reducing electromagnetic interference (EMI) on electronic control unit (ECU). We adopt simulated annealing (SA), genetic algorithm (GA) and taboo search (TS) to seek optimal solutions, and a Spice-like circuit simulator to analyze common-mode current. Therefore, the proposed system can determine the adequate combinations of the parasitic inductance and capacitance values on printed circuit board (PCB) efficiently and practically, to reduce EMI caused by the common-mode current. Finally, we apply the proposed system to an example circuit to verify the validity and efficiency of the system.
Junichi AKITA Hiroaki TAKAGI Keisuke DOUMAE Akio KITAGAWA Masashi TODA Takeshi NAGASAKI Toshio KAWASHIMA
Although the line-of-sight (LoS) is expected to be useful as input methodology for computer systems, the application area of the conventional LoS detection system composed of video camera and image processor is restricted in the specialized area, such as academic research, due to its large size and high cost. There is a rapid eye motion, so called 'saccade' in our eye motion, which is expected to be useful for various applications. Because of the saccade's very high speed, it is impossible to track the saccade without using high speed camera. The authors have been proposing the high speed vision chip for LoS detection including saccade based on the pixel parallel processing architecture, however, its resolution is very low for the large size of its pixel. In this paper, we propose and discuss an architecture of the vision chip for LoS detection including saccade based on column-parallel processing manner for increasing the resolution with keeping high processing speed.
This paper presents media processor architectures for automotive applications. Media processing applications with their requirements for LSI implementations are first described for vision based driver assistance as well as graphical user interface for car navigation using 3D graphics. Then, parallel processing architectures for vision and graphics in these applications are reviewed with their performance and cost. After that, future trends of automotive media processing such as integration of vision and 3D graphics functions are shown with their applications and the required performance. Moreover, parallel processing architectures are discussed for the integration of vision and graphics. Finally, an prospect of a next-generation media processing LSI for automotives is provided.
Recently, research on parallel processing systems is very active, and many complex topologies have been proposed. A burnt pancake graph is one such topology. In this paper, we prove that a faulty burnt pancake graph with degree n has a fault-free Hamiltonian cycle if the number of the faulty elements is n-2 or less, and it has a fault-free Hamiltonian path between any pair of nonfaulty nodes if the number of the faulty elements is n-3 or less.
Takeshi KUMAKI Yasuto KURODA Masakatsu ISHIZAKI Tetsushi KOIDE Hans Jurgen MATTAUSCH Hideyuki NODA Katsumi DOSAKA Kazutami ARIMOTO Kazunori SAITO
This paper presents a novel optimized real-time Huffman encoder using a pipelined data path based on CAM technology and a parallel code-word-table optimizer. The exploitation of CAM technology enables fast parallel search of the code word table. At the same time, the code word table is optimized according to the frequency of received input symbols and is up-dated in real-time. Since these two functions work in parallel, the proposed architecture realizes fast parallel encoding and keeps a constantly high compression ratio. Evaluation results for the JPEG application show that the proposed architecture can achieve up to 28% smaller encoded picture sizes than the conventional architectures. The obtained encoding time can be reduced by 95% in comparison to a conventional SRAM-based architecture, which is suitable even for the latest end-user-devices requiring fast frame-rates. Furthermore, the proposed architecture provides the only encoder that can simultaneously realize small compressed data size and fast processing speed.
Takeshi KUMAKI Yutaka KONO Masakatsu ISHIZAKI Tetsushi KOIDE Hans Jurgen MATTAUSCH
This paper presents a scalable FPGA/ASIC implementation architecture for high-speed parallel table-lookup-coding using multi-ported content addressable memory, aiming at facilitating effective table-lookup-coding solutions. The multi-ported CAM adopts a Flexible Multi-ported Content Addressable Memory (FMCAM) technology, which represents an effective parallel processing architecture and was previously reported in [1]. To achieve a high-speed parallel table-lookup-coding solution, FMCAM is improved by additional schemes for a single search mode and counting value setting mode, so that it permits fast parallel table-lookup-coding operations. Evaluation results for Huffman encoding within the JPEG application show that a synthesized semi-custom ASIC implementation of the proposed architecture can already reduce the required clock-cycle number by 93% in comparison to a conventional DSP. Furthermore, the performance per area unit, measured in MOPS/mm2, can be improved by a factor of 3.8 in comparison to parallel operated DSPs. Consequently, the proposed architecture is very suitable for FPGA/ASIC implementation, and is a promising solution for small area integrated realization of real-time table-lookup-coding applications.
Toru SHIMIZU Masami NAKAJIMA Masahiro KAINAGA
This paper describes the design and evaluation of a massively parallel processor base on Matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves 40 GOPS of 16-bit fixed-point additions at 200 MHz clock frequency and 250 mW power dissipation. In addition, 1 M-bit SRAM for data registers and 2,048 2-bit processing elements connected by a flexible switching network are integrated in 3.1 mm2 in 90 nm low-power CMOS technology. The energy-efficient Matrix architecture supports 2,048-way parallel operations and the programmable functions required for multimedia SoCs.
Junichi MIYAKOSHI Yuichiro MURACHI Tomokazu ISHIHARA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
For super-parallel video processing, we proposed a power- and area-efficient SRAM core architecture with a segmentation-free access, which means accessibility to arbitrary consecutive pixels, and horizontal/vertical access. To achieve these flexible accesses, a spirally-connected local-wordline select signal and multi-selection scheme in wordlines are proposed, so that extra X-decoders in the conventional multi-division SRAM can be eliminated. Consequently, the proposed SRAM reduces a power and area by 57-60% and 60%, respectively, when it is applied to a 128 parallel architecture. The proposed 160-kbit SRAM with 16-read ports (2-read port SRAM with eight-parallel architecture) is implemented to a search window buffer for an H.264 motion estimation processor core which dissipates 800 µW for QCIF 15-fps in a 130-nm technology.
Junichi AKITA Hiroaki TAKAGI Takeshi NAGASAKI Masashi TODA Toshio KAWASHIMA Akio KITAGAWA
Rapid eye motion, or so called saccade, is a very quick eye motion which always occurs regardless of our intention. Although the line of sight (LOS) with saccade tracking is expected to be used for a new type of computer-human interface, it is impossible to track it using the conventional video camera, because of its speed which is often up to 600 degrees per second. Vision Chip is an intelligent image sensor which has the photo receptor and the image processing circuitry on a single chip, which can process the acquired image information by keeping its spatial parallelism. It has also the ability of implementing the very compact integrated vision system. In this paper, we describe the vision chip architecture which has the capability of detecting the line of sight from infrared eye image, with the processing speed supporting the saccade tracking. The vision chip described here has the pixel parallel processing architecture, with the node automata for each pixel as image processing. The acquired image is digitized to two flags indicating the Purkinje's image and the pupil by comparators at first. The digitized images are then shrunk, followed by several steps of expanding by node automata located at each pixel. The shrinking process is kept executed until all the pixels disappear, and the pixel disappearing at last indicates the center of the Purkinje's image and the pupil. This disappearing step is detected by the projection circuitry in pixel circuit for fast operation, and the coordinates of the center of the Purkinje's image and the pupil are generated by the simple encoders. We describe the whole architecture of this vision chip, as well as the pixel architecture. We also describe the evaluation of proposed algorithm with numerical simulation, as well as processing speed using FPGA, and improvement in resolution using column parallel architecture.
Takashi MORIMOTO Hidekazu ADACHI Osamu KIRIYAMA Tetsushi KOIDE Hans Jurgen MATTAUSCH
This letter presents a boundary-active-only (BAO) power reduction technique for cell-network-based region-growing video segmentation. The key approach is an adaptive situation-dependent power switching of each network cell, namely only cells at the boundary of currently grown regions are activated, and all the other cells are kept in low-power stand-by mode. The effectiveness of the proposed technique is experimentally confirmed with CMOS test-chips having small-scale cell networks of up to 4133 cells, where an average of only 1.7% of the cells remains active after application of the proposed approach. About 85% power reduction is thus achievable without sacrificing real-time processing.
FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1],[2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.
Alex K. JONES Jiang ZHENG Ahmed AMER
The performance of parallel computing applications is highly dependent on the efficiency of the underlying communication operations. While often characterized as dynamic, these communication operations frequently exhibit spatial and temporal locality as well as regularity in structure. These characteristics can be exploited to improve communication performance if the correct prediction model is selected to a suitable communication topology. In this paper we describe an entropy based methodology for quantifying and evaluating the success of different prediction models on actual workloads drawn from representative parallel benchmarks. We evaluate two different prediction criteria and combinations thereof: (1) Messages are partitioned by source node. (2) Use of a first order context model. We also describe the threshold for predication designed to largely avoid incorrect predication overheads. Our results show for simple predication models, even on highly dynamic benchmark applications, predictability can be improved by several orders of magnitude. In fact, using simple prediction techniques, over 75% of the communication volume is accurately predictable.
Kazutoshi KOBAYASHI Masao ARAMOTO Hidetoshi ONODERA
We propose a low-power resource-shared VLIW processor (RSVP) for future leaky nanometer process technologies. It consists of several single-way independent processor units (IPUs) that share parallel processor resources. Each IPU works as a variable-way VLIW processor sharing the parallel resources according to priorities of given tasks. RSVP allocates shared parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that is wasting power. The performance per power (P3) of a 4-parallel 4-way RSVP that corresponds to four 4way VLIWs is 3.7% better than a conventional 4-parallel 4-way VLIW multiprocessor in the current 90 nm process. We estimate that the RSVP achieves 36% less leakage power and 28% better P3 in the future 25 nm process. We have fabricated an RSVP test chip that contains two IPU and a shared resource equivalent to two 2way VLIWs in a 180 nm process. It is functional at 100 MHz clock speed and its power is 130 mW.
Hyung LEE Hyeon-Koo CHO Dae-Sang YOU Jong-Won PARK
To fulfill the computing demands in visual media processing, we have been investigating a parallel processing system to improve the processing speed of the visual media related to applications from the point of view of a memory system within a single instruction multiple data (SIMD) computer. In this paper, we have introduced MAMS-PP4, which is similar to a pipelined SIMD architecture type and consists of pq processing elements (PEs) as well as a multi-access memory system (MAMS). MAMS supports simultaneous access to pq data elements within a horizontal (1 pq), a vertical (pq 1) or a block (p q) subarray with a constant interval in an arbitrary position in an M N array of data elements, where the number of memory modules, m, is a prime number greater than pq. MAMS reduces the memory access time for an SIMD computer and also improves the cost and complexity that involved in controlling the large volume of data demanded in visual media applications. PE is designed to be a two-state machine in order to utilize MAMS efficiently. MAMS-PP4 was fabricated into ASIC using TOSHIBA TC240C series library and a test board was used to measure the performance of ASIC. The test board consists of devices such as an MPC860 embedded-PCI board, two ASICs and a FPGA for the control units. Experiment was done on various computer systems in order to compare the performance of MAMS-PP4 using morphological operations as the application. MAMS-PP4 shows a respectful and consistent processing speed.